## 2.6.4. Address registers and address computation

Address of a memory location – nr. of consecutive  $\underline{bytes}$  from the beginning of the RAM memory and the beginning of that memory location.

An uninterrupted sequence of memory locations, used for similar purposes during a program execution, represents a *segment*. So, a segment represents a logical section of a program's memory, featured by its *basic address* (beginning), by its *limit* (size) and by its *type*. Both basic address and segment's size have 32 bits value representations.

In the family of 8086-based processors, the term **segment** has two meanings:

- 1. A block of memory of discrete size, called a *physical segment*. The number of bytes in a physical memory segment is
  - o (a) 64K for 16-bit processors
  - o (b) 4 gigabytes for 32-bit processors.
- 2. A variable-sized block of memory, called a *logical segment* occupied by a program's code or data.

We will call *offset* the address of a location relative to the beginning of a segment, or, in other words, the number of bytes between the beginning of that segment and that particular memory location. An offset is valid only if his numerical value, on 32 bits, doesn't exceed the segment's limit which refers to.

We will call *address specification* a pair of a *segment selector* and an *offset*. A <u>segment selector</u> is a numeric value of 16 bits which selects uniquely the accessed segment and his features. In hexadecimal an address <u>specification</u> can be written as:

#### $S_3S_2S_1S_0$ : $O_7O_6O_5O_4O_3O_2O_1O_0$

In this case, the selector  $s_3s_2s_1s_0$  shows a segment access which has the base address as  $b_7b_6b_5b_4b_3b_2b_1b_0$  and a limit  $l_7l_6l_5l_4l_3l_2l_1l_0$ . The base and the limit are obtained by the processor after performing a segmentation process.

To give access to the specific location, the following condition must be accomplished:

#### $0706050403020100 \le 1716151413121110$ .

Based on such a specification the actual segmentation address computation will be performed as:

# $a_7a_6a_5a_4a_3a_2a_1a_0 := b_7b_6b_5b_4b_3b_2b_1b_0 + o_7o_6o_5o_4o_3o_2o_1o_0$

where  $a_7a_6a_5a_4a_3a_2a_1a_0$  is the computed address (hexadecimal form). The above output address is named a *linear* address. (or segmentation address).

An address specification is also named FAR address. When an address is specified only by offset, we call it NEAR address.

A concrete example of an address specification is: **8:1000h** 

To compute the linear address corresponding to this specification, the processor will do the following:

- 1. It checks if the segment with the value 8 was defined by the operating system and blocks the access such a segment wasn't defined;
- 2. It extracts the base address (B) and the segment's limit (L), for example, as a result we may have B-2000h and L=4000h;
- 3. It verifies if the offset exceeds the segment's limit: 1000h > 4000h ? if so, then the access would be blocked;
- 4. It adds the offset to B and obtains the linear address 3000h (1000h + 2000h). This computation is performed by the **ADR** component from **BIU**.

This kind of addressing is called *segmentation* and we are talking about the *segmented addressing model*.

When the segments start from address 0 and have the maximum possible size (4GiB), any offset is automatically valid and segmentation isn't practically involved in addresses computing. So, having  $b_7b_6b_5b_4b_3b_2b_1b_0 = 000000000$ , the address computation for the logical address  $s_3s_2s_1s_0$ :  $o_7o_6o_5o_4o_3o_2o_1o_0$  will result in the following linear address:

$$a_7a_6a_5a_4a_3a_2a_1a_0 := 00000000 + 0_70_60_50_40_30_20_10_0$$

$$a_7a_6a_5a_4a_3a_2a_1a_0 := 0_70_60_50_40_30_20_10_0$$

This particular mode of using the segmentation, used by most of the modern operating systems is called the *flat* memory model.

The x86 processors also have a memory access control mechanism called *paging*, which is independent of address segmentation. Paging implies dividing the *virtual* memory into *pages*, which are associated (translated) to the available physical memory.

The configuration and the control of segmentation and paging are performed by the operating system. Of these two, only segmentation interferes with address specification, paging being completely transparent relative to the user programs.

Both addresses computing and the use of segmentation and paging are influenced by the execution mode of the processor, the x86 processors supporting the following more important execution modes:

- real mode, on 16 bits (using memory word of 16 bits and having limited memory at 1MiB);
- protected mode on 16 or 32 bits, characterized by using paging and segmentation;
- 8086 virtual mode, allows running real mode programs together with the protected ones;
- long mode on 64 and 32 bits, where paging is mandatory while segmentation is deactivated.

In our course we will focus on the architecture and the behavior of x86 processors in protected mode on 32 bits.

The x86 architecture allows 4 types of segments:

- code segment, which contains instructions;
- data segment, containing data which instructions work on;
- stack segment;
- extra segment; (supplementary data segment)

Every program is composed by one or more segments of one or more of the above specified types. At any given moment during run time there is only at most one <u>active</u> segment of any type. Registers **CS** (*Code Segment*), **DS** (*Data Segment*), **SS** (*Stack Segment*) and **ES** (*Extra Segment*) from **BIU** contain the values of the selectors of the active segments, correspondingly to every type. So registers CS, DS, SS and ES determine the starting addresses and the dimensions of the 4 active segments: code, data, stack and extra segments. Registers FS and GS can store selectors pointing to other auxiliary segments without having predetermined meaning. Because of their use, CS, DS, SS ES, FS and GS are called *segment registers* (or *selector registers*). Register **EIP** (which offers also the possibility of accessing its less significant word by referring to the **IP** subregister) contains the offset of the current instruction inside the current code segment, this register being managed exclusively by **BIU**.

Because addressing is fundamental for understanding the functioning of the x86 processor and assembly programming, we review its concepts to clarify them:

| Notion                                              | Representation                               | Description                                                   |
|-----------------------------------------------------|----------------------------------------------|---------------------------------------------------------------|
| Address specification, logical address, FAR address | Selector <sub>16</sub> :offset <sub>32</sub> | Defines completely both the segment and the offset inside it. |

| Selector                              | 16 bits              | Identifies one of the available segments. As a numeric value it codifies the position of the selected segment descriptor within a descriptor table. |
|---------------------------------------|----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|
| Offset, NEAR address                  | Offset <sub>32</sub> | Defines only the offset component (considering that the segment is known or that the flat memory model is used).                                    |
| Linear address (segmentation address) | 32 bits              | Segment beginning + offset, represents the result of the segmentation computing.                                                                    |
| Physical effective address            | At least 32 bits     | Final result of segmentation plus paging eventually. The final address obtained by BIU points to physical memory (hardware).                        |

### 2.6.5. Machine instructions representation

A x86 machine instruction represents a sequence of 1 to 15 bytes, these values specifying an operation to be run, the operands to which it will be applied and also possible supplementary modifiers.

A x86 machine instruction has maximum 2 operands. For most of the instructions, they are called *source* and *destination* respectively. From these two operands, only one may be stored in the <u>RAM memory</u>. The other one must be either one <u>EU register</u>, either an <u>integer constant</u>. Therefore, an instruction has the general form:

#### instruction\_name destination, source

The internal format of an instruction varies between 1 and 15 bytes, and has the following general form (*Instructions byte-codes from OllyDbg*):

The *prefixes* control how an instruction is executed. These are optional (0 to maxim 4) and occupy one byte each. For example, they may request repetitive execution of the current instruction or may block the address bus during execution to not allow concurrent access to operands and results.

The operation to be run is identified by 1 to 2 bytes of *code* (opcode), which are the only mandatory bytes, no matter of the instruction. The byte *ModeR/M* (register/memory mode) specifies for some instructions the nature and the exact storage of operands (register, memory, constant etc.). This allows the specification of a register or of a memory location described by an offset.



Although the diagram seems to imply that instructions can be up to 16 bytes long, in actuality the x86 will not allow instructions greater than 15 bytes in length.

For more complex addressing cases than the one implemented directly by ModeR/M, combining this with SIB byte allows the following formula for an offset:

# $[base] + [index \times scale] + [constant]$

where for base and index the value of two registers will be used and the scale is 1, 2, 4 or 8. The allowed registers as base or/ and as indexes are: EAX, EBX, ECX, EDX, EBP, ESI, EDI. The ESP register is available as base but cannot be used as index.

Most of the instructions use for their implementation either only the opcode or the opcode followed by ModeR/M.

The *displacement* is present in some particular addressing forms and it comes immediately after ModeR/M or SIB, if SIB is present. This field can be encoded either on a byte or on a doubleword (32 bits).

As a consequence of the impossibility of appearing more than one ModeR/M, SIB and displacement fields in one instruction, the x86 architecture doesn't allow encoding of two memory addresses in the same instruction.

With the *immediate value* we can define an operand as a numeric constant on 1, 2 or 4 bytes. When it is present, this field appears always at the end of instruction.

## 2.6.6. FAR addresses and NEAR addresses.

To address a RAM memory location two values are needed: one to indicate the segment and another one to indicate the offset inside that segment. For simplifying the memory reference, the microprocessor implicitly chooses, in the absence of other specification, the segment's address from one of the segment registers CS, DS, SS or ES. The implicit choice of a segment register is made after some particular rules specific to the used instruction.

An address for which only the offset is specified, the segment address being implicitly taken from a segment register is called a *NEAR address*. A NEAR address is always inside one of the 4 active segments.

An address for which the programmer <u>explicitly specifies a segment selector</u> is called a *FAR address*. So, a FAR address is a COMPLETE ADDRESS SPECIFICATION and it may be specified in one of the following 3 ways:

- $s_3s_2s_1s_0$ : offset\_specification where  $s_3s_2s_1s_0$  is a constant;
- segment register: offset\_specification, where segment registers are CS, DS, SS, ES, FS or GS;
- FAR [variable], where variable is of type QWORD and contains the 6 bytes representing the FAR address.

The internal format of an FAR address is: <u>at the smallest address is the offset, and at the higher (by 4 bytes) address</u> (the word following the current doubleword) <u>is the word which stores the segment selector.</u>

The address representation follows the little-endian representation presented in Chapter 1, paragraph 1.3.2.3: the less significant part has the smallest address, and the most significant one has the higher address.

## 2.6.7. Computing the offset of an operand. Addressing modes.

For an instruction there are 3 ways to express a required operand:

- register mode, if the required operand is a register;
- *immediate mode*, when we use directly the operand's value (not its address and neither a register holding it);
- *memory addressing mode*, if the operand is located somewhere in memory. In this case, its offset is computed using the following formula:

So *offset\_address* is obtained from the following (maximum) four elements:

- the content of one of the registers EAX, EBX, ECX, EDX, EBP, ESI, EDI or ESP as base;
- the content of one of the registers EAX, EBX, ECX, EDX, EBP, ESI or EDI as index;
- scale to multiply the value of the index register with 1, 2, 4 or 8;
- the value of a numeric constant, on a byte or on a doubleword.

From here results the following modes to address the memory:

- *direct addressing*, when only the *constant* is present;
- based addressing, if in the computing one of the base registers is present;
- scale-indexed addressing, if in the computing one of the index registers is present.

These three mode of addressing could be combined. For example, it can be present direct based addressing, based addressing and scaled-indexed etc.

In the case of the jump instructions another type of addressing is present called *relative* addressing.

Relative addressing indicates the position of the next instruction to be run relative to the current position. This "distance" is expressed as the number of bytes to jump over. The x86 architecture allows relative SHORT addresses, described on a byte and having values between -128 and 127, but also relative NEAR addresses, represented on a doubleword with values between -2147483648 and 2147483647.